Tag SNP Selection Based on Multivariate Linear Regression

نویسندگان

  • Jingwu He
  • Alex Zelikovsky
چکیده

The search for the association between complex diseases and single nucleotide polymorphisms (SNPs) or haplotypes has been recently received great attention. For these studies, it is essential to use a small subset of informative SNPs (tag SNPs) accurately representing the rest of the SNPs. Tagging can achieve budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs and compaction of extremely long SNP sequences (obtained, e.g., from Affimetrix Map Array) for further fine genotype analysis. Tagging should first choose tags from the SNPs under consideration and then knowing the values of chosen tag SNPs predict (or statistically cover) the non-tag SNPs. In this paper we propose a new SNP prediction method based on rounding of multivariate linear regression (MLR) analysis in sigmarestricted coding. When predicting a non-tag SNP, the MLR method accumulates information about all tag SNPs resulting in significantly higher prediction accuracy with the same number of tags than for the previously known tagging methods. We also show that the tag selection strongly depends on how the chosen tags will be used – advantage of one tag set over another can only be considered with respect to a certain prediction method. Two simple universal tag selection methods have been applied: a (faster) stepwise and a (slower) local-minimization tag selection algorithms. An extensive experimental study on various datasets including 6 regions from HapMap shows that the MLR prediction combined with stepwise tag selection uses significantly fewer tags (e.g., up to two times less tags to reach 90% prediction accuracy) than the state-ofart methods of Halperin et al. [9] for genotypes and Halldorsson et al. [8] for haplotypes, respectively. Our stepwise tagging matches the quality of while being faster than STAMPA [9]. The code is publicly available at http://alla.cs.gsu.edu/∼software.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MLR-tagging: informative SNP selection for unphased genotypes based on multiple linear regression

UNLABELLED The search for the association between complex diseases and single nucleotide polymorphisms (SNPs) or haplotypes has recently received great attention. For these studies, it is essential to use a small subset of informative SNPs accurately representing the rest of the SNPs. Informative SNP selection can achieve (1) considerable budget savings by genotyping only a limited number of SN...

متن کامل

Linear reduction method for predictive and informative tag SNP selection

Constructing a complete human haplotype map is helpful when associating complex diseases with their related SNPs. Unfortunately, the number of SNPs is very large and it is costly to sequence many individuals. Therefore, it is desirable to reduce the number of SNPs that should be sequenced to a small number of informative representatives called tag SNPs. In this paper, we propose a new linear al...

متن کامل

A new model of multi-marker correlation for genome-wide tag SNP selection.

Tag SNP selection is an important problem in computational biology and genetics because a small set of tag SNP markers may help reduce the cost of genotyping and thus genome-wide association studies. Several methods for selecting a smallest possible set of tag SNPs based on different formulations of tag SNP selection (block-based or genome-wide) and mathematical models of marker correlation hav...

متن کامل

Smooth-Threshold Multivariate Genetic Prediction with Unbiased Model Selection.

We develop a new genetic prediction method, smooth-threshold multivariate genetic prediction, using single nucleotide polymorphisms (SNPs) data in genome-wide association studies (GWASs). Our method consists of two stages. At the first stage, unlike the usual discontinuous SNP screening as used in the gene score method, our method continuously screens SNPs based on the output from standard univ...

متن کامل

Genome-wide selection of tag SNPs using multiple-marker correlation

MOTIVATIONS The tag SNP approach is a valuable tool in whole genome association studies, and a variety of algorithms have been proposed to identify the optimal tag SNP set. Currently, most tag SNP selection is based on two-marker (pairwise) linkage disequilibrium (LD). Recent literature has shown that multiple-marker LD also contains useful information that can further increase the genetic cove...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006